GH-48868: [Doc] Document security model for the Arrow formats#48870
GH-48868: [Doc] Document security model for the Arrow formats#48870pitrou merged 2 commits intoapache:mainfrom
Conversation
|
@github-actions crossbow submit preview-docs |
|
Revision: 593babb Submitted crossbow builds: ursacomputing/crossbow @ actions-4f7018459b
|
raboof
left a comment
There was a problem hiding this comment.
Looks reasonable (without any particular Arrow expertise)
(noticed two typo's)
docs/source/format/Security.rst
Outdated
| uninitialized in a buffer if the array might be sent to, or read by, a untrusted | ||
| third-party, even when the uninitialized data is logically irrelevant. The | ||
| easiest way to do this, though perhaps not the most efficient, is to zero-initialize | ||
| any buffer that will not be populated in full. |
There was a problem hiding this comment.
Worth pointing out something about query engines and dataframe libraries deciding to not do so for internal/intermediate values in computations but applying a canonicalization pass when data leaves the system.
There was a problem hiding this comment.
Perhaps we can emphasize that all bytes in an Arrow array, regardless if they are "reachable", are readable by other libraries and users. Thus they should contain no potentially sensitive data (like uninitialized values).
And therefore, if query engines choose to use uninitialized memory internally as an optimization, they should ensure all such uninitialized values are cleared before passing the Arrays to another system
| from an untrusted source (for example because you are writing a proxy to | ||
| an arbitrary third-party service), it is **recommended** that you validate | ||
| the data first, as the consumer may assume that the data is valid already. | ||
|
|
There was a problem hiding this comment.
| In addition to invalid pointers, some array types have offsets, sizes, and buffer indices that might be out-of-bounds. The library producing arrays through the | |
| C data interface might be performing only very light validation of these values. |
docs/source/format/Security.rst
Outdated
| Advice for users | ||
| '''''''''''''''' | ||
|
|
||
| If you receive Arrow in-memory data from an untrusted source, it is |
There was a problem hiding this comment.
I suggest we also make the point about performance here to give context about why
validation is not always performed
Perhaps something like this:
"Arrow implementations often assume Arrays follow the specification
to provide high speed processing. It is extremely important that
your application either trusts or validates the Arrays it receives from
other sources.
Many Arrow implementations provide APIs to do such validation.
In terms of APIs, the Rust implementation always validates data from external sources, unless the validation is explicitly turned off with APIs marked as unsafe (a special Rust keyword).
docs/source/format/Security.rst
Outdated
| uninitialized in a buffer if the array might be sent to, or read by, a untrusted | ||
| third-party, even when the uninitialized data is logically irrelevant. The | ||
| easiest way to do this, though perhaps not the most efficient, is to zero-initialize | ||
| any buffer that will not be populated in full. |
There was a problem hiding this comment.
Perhaps we can emphasize that all bytes in an Arrow array, regardless if they are "reachable", are readable by other libraries and users. Thus they should contain no potentially sensitive data (like uninitialized values).
And therefore, if query engines choose to use uninitialized memory internally as an optimization, they should ensure all such uninitialized values are cleared before passing the Arrays to another system
docs/source/format/Security.rst
Outdated
| '''''''''''''''' | ||
|
|
||
| If you produce a C Data Interface structure for data that nevertheless comes | ||
| from an untrusted source (for example because you are writing a proxy to |
There was a problem hiding this comment.
I don't think this is any different than the other APIs -- basically "if you don't trust the producer source, you should always explicitly validate the arrays before processing them"
This doesn't seem any different for the C Data Interface than for the other APIs (like IPC files. etc)
There was a problem hiding this comment.
Hmm, that's true. I might just remove if it muddies the message.
| a trusted producer, for the reason explained above. However, it is still **recommended** | ||
| that you validate it for soundness, as a trusted producer can have bugs anyway. | ||
|
|
||
| IPC Format |
There was a problem hiding this comment.
As above, I think we could combine this into the section about validating data from untrusted sources, and give C Data Interface and IPC Format as examples of potentially untrusted sources.
There was a problem hiding this comment.
"This" means the IPC format section?
There was a problem hiding this comment.
Yes -- I was thinking that if the guidance is the same for IPC and C Data Interface, calling them out separately makes this documentation more verbose than it could be
The high level overview in my mind is:
- If you read Arrow data (in any format) from an untrusted source, it can be a memory and security concern. Thus you should always verify such data.
- Implementers should provide a way to let users opt in / out of the verification depending on the source and their security threat model.
If there are specific things to suggest remembering to check that is specific for specific formats such as the FlagBuffers offsets for IPC based formats, then we could call them out in those sections.
I am happy to propose some new wording if you like
There was a problem hiding this comment.
Well, the guidance is not the same. For the C Data Interface, it is simply impossible (to my knowledge) to protect against rogue raw pointers. For IPC, validation is possible to ensure safe operation.
There was a problem hiding this comment.
Yes, there's no way to meaningfully validate an ArrowArray other than perhaps to check pointers for unexpected NULL. This is because the ArrowArray does not provide buffer lengths (the consumer has to infer those based on the ArrowSchema and/or the buffer data of previous buffers). This is roughly the same as receiving an Arrow C++ or arrow-rs array from another library: the consumer has to assume it was correctly produced to avoid a crash and a malicious consumer can always attempt to read byte ranges it isn't supposed to based on the buffer pointers).
59c98d4 to
40c916b
Compare
|
I've addressed most review comments and expanded the document quite a bit:
Another round of reviewing is welcome! |
|
@github-actions crossbow submit preview-docs |
|
Revision: 40c916b Submitted crossbow builds: ursacomputing/crossbow @ actions-c3adcc06fb
|
40c916b to
068cc96
Compare
|
Rendered document preview at https://s3.amazonaws.com/arrow-data/pr_docs/48870/format/Security.html |
068cc96 to
29ccd3b
Compare
|
@github-actions crossbow submit preview-docs |
|
@github-actions crossbow submit preview-docs |
|
Revision: 131d70f Submitted crossbow builds: ursacomputing/crossbow @ actions-4dc8dcf3d1
|
131d70f to
f269af7
Compare
|
It looks the In any case, do reviewers agree that this is good enough to go, or do you think there should be further changes? |
alamb
left a comment
There was a problem hiding this comment.
Thank you @pitrou and other reviewers
I think this document is a great addition to the Arrow documentation and will be a great reference to refer people to
I left some small wording suggestions, but I don't think any of them are required to merge this PR.
docs/source/format/Security.rst
Outdated
| 1. *users* of Arrow: that is, developers of third-party libraries or applications | ||
| that use some of the Arrow formats or protocols by calling into Arrow libraries | ||
| as defined below; | ||
|
|
||
| 2. *implementors* of Arrow libraries: that is, libraries that provide APIs | ||
| abstraction away from the details of the Arrow formats and protocols; such | ||
| libraries include the official Arrow implementations documented on | ||
| https://arrow.apache.org, but not only. |
There was a problem hiding this comment.
| 1. *users* of Arrow: that is, developers of third-party libraries or applications | |
| that use some of the Arrow formats or protocols by calling into Arrow libraries | |
| as defined below; | |
| 2. *implementors* of Arrow libraries: that is, libraries that provide APIs | |
| abstraction away from the details of the Arrow formats and protocols; such | |
| libraries include the official Arrow implementations documented on | |
| https://arrow.apache.org, but not only. | |
| 1. *users* of Arrow: that is, developers of third-party libraries or applications | |
| that consume data using Arrow formats or protocols created by another application. | |
| 2. *implementors* of Arrow libraries: that is, libraries that provide APIs | |
| abstraction away from the details of the Arrow formats and protocols; such | |
| libraries include, but are not limited to, the official Arrow implementations documented on | |
| https://arrow.apache.org. |
There was a problem hiding this comment.
Hmm, that changes the meaning quite a bit. I wanted to stress that "users" don't directly implement the Arrow specs, they use language-specific abstraction layers provided by an implementation. Perhaps I should just use these words :)
There was a problem hiding this comment.
I've tried to improve the wording a bit.
There was a problem hiding this comment.
The new wording looks good to me 👍
- users of Arrow: that is, developers of third-party libraries or applications
that don't implement directly implement the Arrow formats or protocols, but
instead call language-specific APIs provided by an Arrow library
(as defined below);
| .. TODO: | ||
| For each layout, we should list the associated security risks and the recommended | ||
| steps to validate (perhaps in Columnar.rst) |
There was a problem hiding this comment.
It seems like links to the invalid fuzz data has been added above
I personally think this PR is already hugely valuable, even without such a list.
Thus I suggest we merge this PR and get it published. We can file an issue to track adding type specific security risks as a follow on
| A typical validation API must return a well-defined error, not crash, if the | ||
| given Arrow data is invalid; it must always be safe to execute regardless of | ||
| whether the data is valid or not. | ||
|
|
There was a problem hiding this comment.
BTW the way the Rust API works is that by default, data read from IPC is explicitly validated (which is indeed quite expensive)
It is possible to turn this validation off via an unsafe API (the unsafe bit means the calling application has to explicitly disable the validation to acknowledge they are trusting the source):
| Advice for users | ||
| ---------------- | ||
|
|
||
| You should **never** consume a C Data Interface structure from an untrusted |
ebbe049 to
61a3e81
Compare
|
@github-actions crossbow submit preview-docs |
|
Revision: 61a3e81 Submitted crossbow builds: ursacomputing/crossbow @ actions-19f42b185d
|
paleolimbot
left a comment
There was a problem hiding this comment.
Thank you for these updates and putting this document together!
|
Thanks a lot for the reviews! |
Rationale for this change
Accessing Arrow data or any of the formats can have non-trivial security implications, this is an attempt at documenting those.
What changes are included in this PR?
Add a Security Considerations page in the Format section.
Doc preview: https://s3.amazonaws.com/arrow-data/pr_docs/48870/format/Security.html
Are these changes tested?
N/A
Are there any user-facing changes?
No.